{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# Code to make small dogs table from all_dogs\n",
    "# (pd.read_csv('all_dogs.csv')\n",
    "#  .sort_values('popularity_all')\n",
    "#  .head(7)\n",
    "#  [['breed', 'grooming', 'food_cost', 'kids', 'size']]\n",
    "#  .to_csv('dogs.csv', index=False)\n",
    "# )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "dogs = pd.read_csv('dogs.csv', index_col='breed')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ch:pandas)=\n",
    "# Working with Dataframes Using pandas"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data scientists work with data stored in tables. This chapter introduces\n",
    "*dataframes*, one of the most widely used ways to represent data tables. We\n",
    "also introduce `pandas`, the standard Python package for working with\n",
    "dataframes. Here is an example of a dataframe that holds information about\n",
    "popular dog breeds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>grooming</th>\n",
       "      <th>food_cost</th>\n",
       "      <th>kids</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>breed</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Labrador Retriever</th>\n",
       "      <td>weekly</td>\n",
       "      <td>466.0</td>\n",
       "      <td>high</td>\n",
       "      <td>medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>German Shepherd</th>\n",
       "      <td>weekly</td>\n",
       "      <td>466.0</td>\n",
       "      <td>medium</td>\n",
       "      <td>large</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Beagle</th>\n",
       "      <td>daily</td>\n",
       "      <td>324.0</td>\n",
       "      <td>high</td>\n",
       "      <td>small</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Golden Retriever</th>\n",
       "      <td>weekly</td>\n",
       "      <td>466.0</td>\n",
       "      <td>high</td>\n",
       "      <td>medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Yorkshire Terrier</th>\n",
       "      <td>daily</td>\n",
       "      <td>324.0</td>\n",
       "      <td>low</td>\n",
       "      <td>small</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Bulldog</th>\n",
       "      <td>weekly</td>\n",
       "      <td>466.0</td>\n",
       "      <td>medium</td>\n",
       "      <td>medium</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Boxer</th>\n",
       "      <td>weekly</td>\n",
       "      <td>466.0</td>\n",
       "      <td>high</td>\n",
       "      <td>medium</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   grooming  food_cost    kids    size\n",
       "breed                                                 \n",
       "Labrador Retriever   weekly      466.0    high  medium\n",
       "German Shepherd      weekly      466.0  medium   large\n",
       "Beagle                daily      324.0    high   small\n",
       "Golden Retriever     weekly      466.0    high  medium\n",
       "Yorkshire Terrier     daily      324.0     low   small\n",
       "Bulldog              weekly      466.0  medium  medium\n",
       "Boxer                weekly      466.0    high  medium"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dogs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a dataframe, each row represents a single record---in this case, a single\n",
    "dog breed. Each column represents a feature about the record---for example, the\n",
    "grooming column represents how often each dog breed needs to be groomed.\n",
    "\n",
    "Dataframes have labels for both columns and rows. For instance, this dataframe\n",
    "has a column labeled grooming and a row labeled German Shepherd. The\n",
    "columns and rows of a dataframe are ordered—we can refer to the Labrador\n",
    "Retriever row as the first row of the dataframe. \n",
    "\n",
    "Within a column, data have the same type. For instance, the the cost of food \n",
    "contains numbers, and the  size of the dog consists of categories. But data types can\n",
    "be different within a row.\n",
    "\n",
    "Because of these properties, dataframes enable all sorts of useful operations."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "Data scientists often find themselves working with people from\n",
    "different backgrounds who use different terms. For instance, computer\n",
    "scientists say that the columns of a dataframe represent *features* of the\n",
    "data, while statisticians call them *variables* instead.\n",
    "\n",
    "Other times, people use the same term to refer to slightly different\n",
    "ideas. *Data types* in a programming sense refers to how a computer stores\n",
    "data internally. For instance, the `size` column has a string data type in\n",
    "Python. But from a statistical point of view, the type of the `size` column is ordered categorical data (ordinal data). We talk more about this specific distinction in {ref}`ch:eda`.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this chapter, we introduce common dataframe operations using\n",
    "`pandas`. Data scientists use the `pandas` library when working with dataframes\n",
    "in Python. First, we explain the main objects that `pandas` provides: the\n",
    "`DataFrame` and `Series` classes. Then we show how to use `pandas` to\n",
    "perform common data manipulation tasks, like slicing, filtering, sorting,\n",
    "grouping, and joining."
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "301397dfac98ba5645188aec337edeb5e3836fad86b22b9be8631e97bb683640"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
